Spelling Normalization of Historical German with Sparse Training Data

نویسنده

  • Marcel Bollmann
چکیده

Recently, there has been a growing interest in historical language corpora. Projects to create such corpora exist for a variety of languages such as German (Scheible et al. 2011), Spanish (SánchezMarco et al. 2010), or Slovene (Erjavec 2012). Annotation of these corpora is complicated by the fact that specialized tools for these language stages are typically not available. A common approach is to employ spelling normalization to map historical wordforms to modern ones (e.g., Adesam et al. 2012, Baron et al. 2009, Jurish 2010), so that existing tools for modern language (e.g., modern POS taggers) can be used on the normalized data. This paper presents an approach to spelling normalization that combines three different normalization algorithms and evaluates it on a diverse set of texts of historical German. The evaluation shows that this approach produces acceptable results even with comparatively small amounts of training data. The normalization methods were previously described in Bollmann (2012), though with a much more restricted evaluation.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

POS Tagging for Historical Texts with Sparse Training Data

This paper presents a method for part-ofspeech tagging of historical data and evaluates it on texts from different corpora of historical German (15th–18th century). Spelling normalization is used to preprocess the texts before applying a POS tagger trained on modern German corpora. Using only 250 manually normalized tokens as training data, the tagging accuracy of a manuscript from the 15th cen...

متن کامل

Manual and semi-automatic normalization of historical spelling - case studies from Early New High German

This paper presents work on manual and semi-automatic normalization of historical language data. We first address the guidelines that we use for mapping historical to modern word forms. The guidelines distinguish between normalization (preferring forms close to the original) and modernization (preferring forms close to modern language). Average inter-annotator agreement is 88.38% on a set of da...

متن کامل

Improving historical spelling normalization with bi-directional LSTMs and multi-task learning

Natural-language processing of historical documents is complicated by the abundance of variant spellings and lack of annotated data. A common approach is to normalize the spelling of historical words to modern forms. We explore the suitability of a deep neural network architecture for this task, particularly a deep bi-LSTM network applied on a character level. Our model compares well to previou...

متن کامل

Normalizing Medieval German Texts: from rules to deep learning

The application of NLP tools to historical texts is complicated by a high level of spelling variation. Different methods of historical text normalization have been proposed. In this comparative evaluation I test the following three approaches to text canonicalization on historical German texts from 15th–16th centuries: rule-based, statistical machine translation, and neural machine translation....

متن کامل

Using Comparable Collections of Historical Texts for Building a Diachronic Dictionary for Spelling Normalization

In this paper, we argue that comparable collections of historical written resources can help overcoming typical challenges posed by heritage texts enhancing spelling normalization, POS-tagging and subsequent diachronic linguistic analyses. Thus, we present a comparable corpus of historical German recipes and show how such a comparable text collection together with the application of innovative ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013